Stochastic Gradient Descent (SGD)#

../../_images/sgd-meme.png

Stochastic Gradient Descent is an optimization algorithm that differs from the usual Gradient Descent in that the gradient of the optimized function is considered at each step not as the sum of gradients from each element of the sample, but as the gradient from one randomly selected element

Gradient Descent (GD) review#

Before going into the study of stochastic gradient descent, let’s recall the formula of ordinary gradient descent. We have the following:

  • \(\boldsymbol w\) = \({(w_1, w_2, ..., w_n)}^{T}\) is a vector of parameters, where \(n\) is the size of the \(\boldsymbol w\)

  • \(L\) - differentiable multi-variable function

  • \(\nabla L(\boldsymbol w)\) - gradient of the function (the meaning is shown (11))

  • \(Q(\boldsymbol w)\) - function which we tend to minimize where with respect to “\(\boldsymbol w\)” or \(Q(\boldsymbol w) = \sum\limits_{i=1}^\ell L_i(\boldsymbol w) \to \min\limits_{\boldsymbol w}\)

  • \(\ell\) - number of input vector parameters
    Choose \(\boldsymbol w^{(0)}\) - initial vector approach. Then, each subsequent vector of parameters will be calculated as

(15)#\[\boldsymbol w = \boldsymbol w - \eta \sum\limits_{i=1}^\ell \nabla L_i(\boldsymbol w)\]

where \(\eta\) - gradient step or learning rate (responsible for how much to change the vector of parameters in the direction of the gradient)

The stopping of the algorithm will be determined by the convergence of \(Q\) or \(\boldsymbol w\).

Stochastic Gradient Descent (SGD)#

GD algorithm problem#

Considering the previous algorithm (15), we come to a problem. To determine a new approximation of the vector \(\boldsymbol w\), it is necessary to calculate the gradient of each element i.e. \( \sum\limits_{i=1}^\ell \nabla L_i(\boldsymbol w)\) HOWEVER It can significantly slow down the algorithm since it is very resource - intensive

SGD derivation#

In order to improve the efficiency of the algorithm we may replace calculation of the real gradient for the whole input points with the pseudogradient for only one point or sequence of points.

Pseudogradient (\(\nabla \widehat Q\)) must have the following properties:

  • It must form an acute (sharp) angle with a true gradient in the “n” dimensional space

  • It must be calculated much simpler than the true gradient \(\nabla Q(\boldsymbol w)\) over the input points.

Accordingly, at each step, instead of calculating the whole sum of gradients from the whole number of inputs (\(\sum\limits_{i=1}^l \nabla L_i(\boldsymbol w)\)) one observation can be considered (\(\nabla L_k(\boldsymbol w)\)), i.e.

(16)#\[\boldsymbol w = \boldsymbol w - \eta \nabla L_k(\boldsymbol w)\]

where, \( 1\leqslant k \leqslant l\) (usually random)

SGD Realization Pseudocode#

Algorithm (Stochastic Gradient descent)

  • Input: Convergence step (learning rate) “\(\eta\)”, rate of “forgetting” “\(\lambda\)

  • Output: Vector of parameters \(\boldsymbol w\)

To solve the optimization problem

\[ L(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]

do the followind steps:

  1. Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))

  2. Initial calculation of the function: \((\widehat Q (\boldsymbol w) = \frac 1l\sum\limits_{i=1}^\ell L_i(\boldsymbol w))\)

  3. While \(\widehat Q\) and/or \(\boldsymbol w\) don’t reach set values DO:

  • Choose random index value \(k\) where, 1 \leqslant k \leqslant l

  • Calculate function \(E_k = L_k(\boldsymbol w)\)

  • Recalculate vector: \(\boldsymbol w = \boldsymbol w - \eta \nabla L_k(\boldsymbol w)\) (rate of “forgetting” “\(\lambda\)” is just a constant which is commonly very small)

  • Recalculate function: \(\widehat Q = \lambda E_k + (1-\lambda)\widehat Q\)

  1. return “\(\boldsymbol w\)

Q: How was the function \(\widehat Q = \lambda E_k + (1-\lambda)\widehat Q\) derived?

WARNING:tensorflow:From C:\Users\temir\AppData\Local\Programs\Python\Python311\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

SGD Exercise:#

Given:
Vector \(\boldsymbol w = (w_1, w_2, w_3, w_4, w_5)^{T}\)
Learning rate \(\alpha = 0.6\)
\(k\) - is a random variable, where \(k \in \mathbb [1, 10]\)
\(\nabla L (\boldsymbol w) \) - gradient of function with parameter \(\boldsymbol w\)

Mini Batch GD#

Mini batch gradient descent is actually a mixture of Batch Gradient Descent and SGD. It computes the gradient of the function over small random subsets of the elemets (mini batches).

Algorithm (Mini Batch GD)

To solve the optimization problem:

\[ L(\boldsymbol w) \to \min\limits_{\boldsymbol w} \]

do the followind steps:

  1. Initialization of \(\boldsymbol w\) with some initial values (e.g., from \(\mathcal N(0, 1\)))

  2. Initial calculation of the function: \((\widehat Q (\boldsymbol w) = \frac 1l\sum\limits_{i=1}^\ell L_i(\boldsymbol w))\)

  3. While \(\widehat Q\) and/or \(\boldsymbol w\) don’t reach set values DO:

  • Choose subset of \(m\) random indexes \(k\) i.e B = {\(k_1, k_2, ..., k_m\)}, where 1 \(\leqslant m\) < \(\ell\)

  • Calculate function \(E_{B}(\boldsymbol w) = \sum\limits_{i \in B} L_i(\boldsymbol w)\)

  • Recalculate vector: \(\boldsymbol w = \boldsymbol w - \eta \sum\limits_{i \in B} \nabla L_i(\boldsymbol w)\)

  • Recalculate function: \(\widehat Q = \lambda E_{B}(\boldsymbol w) + (1-\lambda)\widehat Q\)

  1. return \(\boldsymbol w\)

Heuristics#

According to the description, “heuristics” is understood as a set of techniques as well as the methods that facilitate and simplify the solution of corresponded tasks. Here is the list of suggested heuristics during the work with the SGD:

  • Initialization of the vector \(\boldsymbol w\)

  • Choice for learning rate \(\eta\)

Initialization of the vector \(\boldsymbol w\)#

One of the crucial steps during SGD algorithm is initialization of the \(\boldsymbol w\). It is common to set zero vector as initial values, but there are also other approaches.

  1. So, w = 0;

Another way is to take small random \(w\), since if initialize large initial values it can lead to an area with small modulo derivatives and therefore it will be difficult to get out of such an area.

  1. So, \(w_j\) = random(\(-\frac 1{2n}\), \(\frac 1{2n}\)) (The denominator is devided by “2n” because if case of n = 1 the learning rate would be 1, but that’s an unppropriate value. So, it can be at least 0.5 in this case.)

Dynamic Learning Rate \(\eta\)#

As an optimization, learning rate variable \(\eta\) value can be dynamicly changed at each step of iteration. Replace \(\eta\) with a time-dependent learning rate \(\eta(t)\). There is a list of some commonly used formulas for \(\eta(t)\).

\[\eta(t)=\frac 1{t}\]
\[ \eta(t)=\eta_i \;\;\text{ if }\;\; t_i \leqslant t < t_{i+1} \]
\[\eta(t)=\eta_o \cdot e^{-\lambda t}, \;\;\text{exponential decay}\;\;\]
\[\eta(t)=\eta_o \cdot (\beta t + 1)^{-\alpha}, \;\;\text{polynomial decay}\;\;\]

ГРАФИК ДОБАВИТЬ!

Note

If \(\eta(t)\) changes too quick, then stop optimizing prematurely. If we decrease it too slowly, we waste too much time on optimization.

Advantages and Disadvantages#

As any algorithm, the SGD has its own pros and cons.
Advatages:

  • Simple for implementation

  • if the function \(L\) is not differentiable, it can be approximated by differentiable

  • Suitable for the exercises with huge number of data (input points)

Disadvantages:

  • It can simply get stuck in local optima

  • It has strong fluctuation

Optimizers of Gradient Algorithms#

The main disadvantage of gradient algorithms is getting stuck in local optima (minimum).

https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Extrema_example_original.svg/1200px-Extrema_example_original.svg.png

Moreover, at such points, the derivative is zero, which means that the further change of vector parameters \(\boldsymbol w\) according to the formula (16) ( \(\nabla Q(w) = \nabla L_k(w, x_k)\) = 0 ) will not happen (we will stuck).

In addition, stochastic gradient algorithms can form strong fluctuations when moving to the minimum point of the function

../../_images/b69ab86a9809ae2247d2e26aeaf640533f9452f83eb437e571ddd9134441b0f1.png

For solving above problems there were dedicated several solutions. See below

Momentum Gradient Descent#

../../_images/momentum-meme.png

It was proposed to average the gradient in steps using the formula:

(17)#\[\boldsymbol v = \gamma \boldsymbol v + (1-\gamma)\eta_t \cdot \nabla Q(\boldsymbol w)\]

where, \(\boldsymbol v\) - accumulates the average of the gradients
\(\;\;\;\;\;\;\;\;\;\gamma\) - parameter (using it, you can adjust how many past gradients we will take into account in the value of \(\boldsymbol v\))

Note

if rewrite the \((1-\gamma)\eta_t\) from the formula {eq}’momentum-formula’ to \(\eta_t = (1-\gamma)\cdot\eta_t\) we obtain the algorithm: \(\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\)

Finally, just change the \(\boldsymbol w\) using the familiar principle:

(18)#\[\boldsymbol w = \boldsymbol w - \boldsymbol v\]

where, \(\boldsymbol v\) - smoothed gradients

NAG - Nesterov’s accelerated gradient#

Taking into account the already mentioned formula in momentum:

\[\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w)\]

where, \(\eta_t = (1-\gamma)\cdot\eta_t\) for simplicity

Now, we can represent the “\(\boldsymbol v\)” as the sum of the vectors “\(\gamma \boldsymbol v\)” and “\(\eta_t \cdot \nabla Q(\boldsymbol w)\)”.

../../_images/sgd-momentum.png

But at the time of calculating the gradient, we already know that we will shift by the value of “\(\gamma \boldsymbol v\)”. Therefore, it would be more correct to calculate the gradient not at point \(\boldsymbol w\), but at point (\(\boldsymbol w - \gamma \boldsymbol v\)) or:

../../_images/sgd-nesterov.png

Mathematically, it is shown as:

(19)#\[\boldsymbol v = \gamma \boldsymbol v + \eta_t \cdot \nabla Q(\boldsymbol w-\gamma \boldsymbol v)\]

And such an approach, as a rule, shows the best convergence to the optimum point.

SGD VS with Momentum VS with Nesterov Momentum#

Visualization for \(f(x) = x^2sin x\)#

../../_images/3595144764983580fb0444a3648a237adc1a6e1347509f263e5604dac5a4fc1c.png
Visualization for \(f(x) = x^2\)#
../../_images/1ef0900fc153d0a9ba8742fb20e2e55c2aa2a7faceea1e663b0269eec6d2f91b.png
import numpy as np
import plotly.graph_objs as go
from plotly.subplots import make_subplots

# Objective function and its gradient
def objective_function(x):
    return x**2

def gradient(x):
    return 2*x

def stochastic_gradient_descent(initial_x, learning_rate, epochs):
    x_values = [initial_x]
    for epoch in range(epochs):
        gradient_val = gradient(x_values[-1])
        update = -learning_rate * gradient_val
        new_x = x_values[-1] + update
        x_values.append(new_x)
    return x_values

def create_animation(x_values, objective_function):
    x_vals = np.linspace(-10, 10, 1000)
    y_vals = objective_function(x_vals)

    fig = make_subplots(rows=1, cols=1)
    scatter = go.Scatter(x=x_vals, y=y_vals, mode='lines', name='Objective Function')
    points = go.Scatter(x=[x_values[0]], y=[objective_function(x_values[0])], mode='markers', name='SGD Steps')
  
    fig.add_trace(scatter)
    fig.add_trace(points)

    frames = [go.Frame(data=[
        go.Scatter(x=x_vals, y=y_vals, mode='lines', name='Objective Function'),
        go.Scatter(x=x_values[:i+1], y=[objective_function(x) for x in x_values[:i+1]],
                   mode='markers', name='SGD Steps')       
    ]) for i in range(1, len(x_values))]

    fig.frames = frames
    return fig

initial_x = 5
learning_rate = 0.1
epochs = 30

x_values = stochastic_gradient_descent(initial_x, learning_rate, epochs)
animation_fig = create_animation(x_values, objective_function)

animation_fig.update_layout(updatemenus=[dict(type='buttons', showactive=False, buttons=[dict(label='Play',
                                 method='animate', args=[None, dict(frame=dict(duration=500, redraw=True),
                                 fromcurrent=True)])])])

animation_fig.show()